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Software  Complexity  Research  Program 

Department  of  Defense  (DOD)  software  production  and  main- 
tenance is  a large,  poorly  understood,  and  inefficient  process. 
Recently  Frost  and  Sullivan  (The  Military  Software  Market,  1977) 
estimated  the  yearly  cost  for  software  within  DOD  to  be  as  large 
as  $9  billion.  DeRoze  (1977)  has  also  estimated  that  115  major 
defense  systems  depend  on  software  for  their  success.  In  an 
effort  to  find  near-term  solutions  to  software  related  problems, 
the  DOD  has  begun  to  support  research  into  the  software  production 
process . 

A formal  5 year  R§D  plan  (Carlson  § DeRoze,  1977)  related 
to  the  management  and  control  of  computer  resources  was  recently 
written  in  response  to  DOD  Directive  5000.29.  This  plan  requested 
research  leading  to  the  identification  and  validation  of  metrics 
for  software  quality.  The  study  described  in  this  paper  repre- 
sents an  experimental  investigation  of  such  metrics  and  is  part 
of  a larger  research  program  seeking  to  provide  valuable  infor- 
mation about  the  psychological  and  human  resource  aspects  of  the 
5 year  plan. 

DOD  is  also  initiating  the  development  of  a more  powerful, 
higher  order  language  for  general  use  by  all  services  (Department 
of  Defense,  1977).  With  a language- independent  measure  of  the 
complexity  of  software,  we  can  evaluate  not  only  program  A versus 
program  B,  but  also  the  individual  constructs  of  a language  (cf. 
Gordon,  1977).  Thus,  an  objective,  quantitative  theory  based  on 
sound  experimental  data  can  replace  idiosyncratic,  subjective 
evaluations  of  the  psychological  complexity  of  software.  T-ong 
term  benefits  of  this  effort  involve  improved  software  system 
reliability  and  reduced  development  and  maintenance  costs. 


i 


The  challenge  undertaken  in  this  research  program  is  to 
quantify  the  psychological  complexity  of  software.  It  is  important 
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to  distinguish  clearly  between  the  psychological  and  computa- 
tional complexity  of  software.  Computational  complexity  refers 
to  characteristics  of  algorithms  or  programs  which  make  their 
proof  of  correctness  difficult,  lengthy,  or  impossible.  For 
example,  as  the  number  of  distinct  paths  through  a program  in- 
creases, the  computational  complexity  also  increases.  Psycho- 
logical complexity  refers  to  those  characteristics  of  software 
which  make  human  understanding  of  software  more  difficult.  No 
direct  linear  relationship  between  computational  and  psychologi- 
cal complexity  is  expected.  A program  with  many  control  paths 
may  not  be  psychologically  complex.  Any  regularity  to  the  branch- 
ing process  within  a program  may  be  used  by  a programmer  to 
simplify  understanding  of  the  program. 

Halstead  (1977)  has  recently  developed  a theory  concerned 
with  the  psychological  aspects  of  computer  programming.  His 
theory  provides  objective  estimates  of  the  effort  and  time 
required  to  generate  a program,  the  effort  required  to  understand 
a program,  and  the  number  of  bugs  in  a particular  program 
(Fitzsimmons  5 Love,  1978).  Some  predictions  of  the  theory 
are  counterintuitive  and  contradict  some  results  of  previous 
psychological  research.  The  theory  has  attracted  attention 
because  independent  tests  of  hypotheses  derived  from  it  have 
proven  amazingly  accurate. 

Although  predictions  of  programmer  behavior  have  been 
particularly  impressive,  much  of  the  research  testing  Halstead's 


theory  has  been  performed  without  sufficient  experimental  or 
statistical  controls.  Further,  much  of  the  data  were  based  upon 
imprecise  estimating  techniques.  Nevertheless,  the  available 
evidence  has  been  sufficient  to  justify  a rigorous  evaluation 
of  the  theory. 

Rather  than  initiate  a research  program  designed  specifically 
to  test  the  theory  of  software  science,  a research  strategy  was 
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chosen  which  would  generate  suggestions  for  improving  programmer 
efficiency  regardless  of  the  success  of  any  particular  theory. 

This  research  has  focused  on  four  phases  of  the  software  life-cycle: 
understanding,  modification,  debugging,  and  construction.  Since 
different  cognitive  processes  are  assumed  to  predominate  in 
each  phase,  no  single  experiment  or  set  of  experiments  on  a 
particular  phase  would  provide  a sufficient  basis  for  making  broad 
recommendations  for  improving  programmer  efficiency.  Each  experi- 
ment in  the  series  comprising  this  research  program  has  been 
designed  to  test  important  variables  assumed  to  effect  a partic- 
ular phase  of  software  development.  Professional  programmers 
have  been  used  in  these  experiments  to  provide  the  greatest  pos- 
sible external  validity  for  the  results  (Campbell  5 Stanley, 

1966).  In  addition,  Halstead’s  theory  of  software  science  and 
other  related  metrics  have  been  evaluated  with  these  data. 
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This  report  is  the  third  in  a series  investigating  char- 
acteristics of  software  which  are  related  to  its  psychological 
complexity.  Three  independent  variables,  length  of  program, 
complexity  of  control  flow,  and  type  of  error,  were  evaluated 
for  three  different  Fortran  programs  in  a debugging  task. 

Fifty- four  experienced  programmers  were  asked  to  locate  a single 
bug  in  each  of  three  programs.  Documentation  consisted  of 
input  files,  correct  output,  and  erroneous  output.  Performance 
was  measured  by  the  time  to  locate  and  successfully  correct  the 
bug. 

Small  but  significant  differences  in  time  to  locate  the 
bug  were  related  to  differences  among  programs  and  presentation 
order.  Although  there  was  no  main  effect  for  type  of  bug,  there 
was  a large  program  by  error  interaction  suggesting  the  existence 
of  context  effects.  Among  measures  of  software  complexity, 
Halstead's  E proved  to  be  the  best  predictor  of  performance, 
followed  by  McCabe's  v (G)  and  the  number  of  lines  of  code. 

Number  of  programming  languages  known  and  familiarity  with 
certain  programming  concepts  also  predicted  performance.  As  in 
the  previous  experiments,  experiential  factors  were  better  pre- 
dictors for  those  participants  with  three  or  fewer  years  experi- 
ence programming  in  Fortran. 
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INTRODUCTION 

Debugging  programs  is  one  of  the  most  expensive,  time- 
consuming  activities  in  the  development  of  a software  system. 
Only  a few  laboratory  experiments  have  investigated  the 
relative  difficulty  of  locating  different  types  of  bugs  or 
the  most  effective  search  strategies.  Youngs  (1974)  found 
that  experience  contributed  to  differences  among  types  of 
errors  made  in  a construction  experiment.  Wescourt  and  Hemphill 
(1978)  described  a model  of  the  debugging  process,  but  the  model 
was  not  entirely  supported  by  the  available  data.  Gould  and 
his  associates  (Gould  and  Drongowski,  1974;  Gould,  1975)  found 
that  the  type  of  bug  influenced  debugging  performance  on  short 
programs.  Specifically,  assignment  bugs  were  more  difficult 
to  locate  than  array  or  iteration  bugs,  probably  because  the 
former  required  a greater  understanding  of  the  algorithm  used 
by  the  program. 

The  difficulty  of  debugging  a program  may  be  associated 
with  coding  practices  used  during  its  development.  One  factor 
which  may  influence  the  ease  of  finding  a bug  is  the  complexity 
of  a program's  control  flow.  Two  previous  experiments  by  the 
authors  investigated  the  effects  of  structured  control  flow  in 
understanding  and  modification  tasks  (Sheppard,  Curtis,  Borst, 
Milliman,  § Love,  1979).  Programmers  performed  their  tasks 
more  efficiently  on  code  which  exhibited  a straightforward, 
top-down  control  flow  than  on  an  unstructured,  convoluted 
control  flow.  A rigorously  structured  control  flow  (Dijkstra, 
1972)  did  not  produce  significantly  better  performance  than  a 
naturally  structured  version  which  allowed  limited  unstructured 
constructs  (e.g.,  exits  from  loops).  Thus  the  overall  top-down 
quality  of  the  control  flow  appears  to  influence  performance, 
while  minor  deviations  from  the  tenets  of  structured  code  do 
not  appear  to  influence  performance  significantly.  This  result 
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may  reflect  the  innate  awkwa.  '.ness  of  implementing  strictly 
structured  code  in  standard  Fortran. 

Factors  other  than  the  structuredness  of  the  control  flow 
may  influence  the  complexity  of  a computer  program  and,  thus, 
the  difficulty  programmers  experience  in  performing  their  tasks. 
Some  of  these  factors  have  been  quantified  in  the  software  com- 
plexity metrics  developed  by  Halstead  (1977)  and  McCabe  (1976). 
Halstead's  metric  purportedly  represents  the  number  of  mental 
discriminations  involved  in  developing  a program,  while  McCabe's 
metric  measures  the  number  of  elementary  control  path  segments 
comprising  a program.  In  experiments  on  understanding  and  modi- 
fication, these  software  complexity  metrics  were  evaluated  for 
their  usefulness  as  predictors  of  programmer  performance  (Curtis, 
Sheppard,  Milliman,  Borst,  5 Love,  1979).  The  results  observed 
in  those  experiments  were  modest.  The  correlations  in  the  raw 
data  were  not  large,  and  the  number  of  lines  of  code  usually 
predicted  programmer  performance  better  than  the  Halstead  or 
McCabe  metrics.  Several  limitations  in  the  experimental  pro- 
cedures employed  to  obtain  the  data  may  have  produced  these 
results.  First,  all  of  the  programs  studied  were  short  (35-55 
lines  of  code).  The  limited  range  of  metric  values  calculated 
on  programs  of  this  length  may  not  have  been  sufficient  for  an 
adequate  test  of  the  predictive  worth  of  the  metrics.  Second, 
individual  differences  among  programmers  exerted  significant 
effects  on  the  results  obtained.  When  the  data  from  the  first 
experiment  were  transformed  in  an  attempt  to  control  for  dif- 
ferences among  programs  and  programmers,  a correlation  of  -.73 
(£  < .001)  was  obtained  between  the  performance  criterion  and 
Halstead's  E.  However,  the  issue  is  not  whether  theories  can 
be  validated  with  mystical  transformations  of  data,  but  whether 
the  results  of  these  heuristic  transformations  can  be  rep’icated 
in  an  experiment  designed  to  overcome  the  limitations  of  previous 
research . 


2 


The  present  experiment  evaluated  the  difficulty  o41  locating 
three  types  of  errors  under  controlled  programming  conditions. 

In  order  to  compare  the  effects  on  performance  of  different 
methods  of  structuring  code,  programs  in  the  present  experiment 
were  implemented  in  three  types  of  control  flow,  all  of  which 
exhibited  a generally  top-down  flow.  This  experiment  also 
evaluated  the  ability  of  software  complexity  metrics  to  predict 
performance  over  a wider  range  of  program  sizes.  To  investigate 
the  effects  of  length,  the  three  programs  in  this  experiment 
were  subdivided  into  functional  subroutines  so  that  they  could 
be  presented  in  three  different  lengths:  approximately  50, 

125,  and  200  lines  of  code.  Finally,  the  present  experiment 
attempted  to  relate  programming  performance  to  experiential 
factors,  such  as  familiarity  with  other  programming  languages 
or  relevant  programming  tools  and  concepts. 


METHOD 


Participants 

Fifty-four  professional  programmers  at  six  different  loca- 
tions participated  in  this  experiment.  Thirty  were  civilian 
employees,  while  24  were  employees  of  the  military.  The  partic- 
ipants averaged  6.6  years  of  professional  experience  programming 
in  Fortran,  ranging  from  1/2  year  to  25  years  (SI3  = 6.1). 

Experimental  Design 

In  order  to  control  for  individual  differences  in  perform- 

4 

ance,  a within- subj ects , 3 factorial  design  was  employed.  Three 
types  of  control  flow  were  defined  for  each  of  three  programs, 
and  each  of  these  nine  versions  was  presented  in  three  lengths 
with  three  different  bugs,  for  a total  of  81  different  experi- 
mental conditions.  The  first  27  participants  each  saw  three  of 
the  programs,  exhausting  the  81  conditions  (Figure  1).  The 
second  set  of  27  participants  replicated  the  conditions  exactly 
except  that  the  order  of  presentation  of  the  tasks  was  different 
in  each  case. 

Learning  effects  were  expected  on  the  basis  of  results 
obtained  in  previous  experiments  of  this  type  (Sheppard,  Curtis, 
Borst,  Milliman,  6 Love,  1979;  Sheppard  8 Love,  1977).  There- 
fore, the  order  of  presentation  of  conditions  was  counterbalanced 
to  assure  that  each  level  of  each  independent  variable  appeared 
as  the  first,  second,  or  third  task  an  equal  number  of  times. 

Procedure 

A packet  of  materials  prepared  for  each  participant  in- 
cluded: 1)  written  instructions  on  the  experimental  tasks 

(Appendix  A),  2)  a short  tutorial  of  commands  used  in  Fortran 
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77  (Appendix  A),  3)  a short  preliminary  task  (Appendix  B) , 

4)  three  experimental  tasks,  and  5)  a questionnaire  concerning 
previous  experience  (Appendix  C) . 

All  tasks  included  input  files,  a listing  of  the  Fortran 
program  with  the  embedded  bug,  a correct  output,  and  the  erro- 
neous output  produced  by  this  program.  All  differences  between 
the  correct  and  erroneous  output  were  circled  on  the  erroneous 
output.  Also  included  were  explanatory  descriptions  of  any 
subroutines  or  functions  not  presented  in  the  listing  but 
referenced  by  the  program. 

The  54  participants  were  divided  into  two  groups  of  27, 
each  of  which  represented  a complete  replication  of  the  design. 

Within  a group  all  participants  were  given  the  same  preliminary 
task.  Group  1 worked  with  an  algorithm  to  find  the  greatest 
common  divisor  of  two  numbers  and  Group  2 was  given  a simple 
sort  algorithm.  These  preliminary  tasks  were  provided  to  reduce 
learning  effects  on  the  experimental  tasks  and  to  provide  a 
basis  for  comparing  the  abilities  of  the  participants  to  perform 
a task  of  this  nature. 

Following  the  initial  exercises,  participants  were  pre- 
sented with  three  separate  programs  comprising  their  experi- 
mental tasks.  Participants  were  allowed  to  work  at  their  own 
pace,  signalling  the  experimenter  when  they  believed  they  had 
identified  and  corrected  the  bug.  The  experimenter  verified 
all  corrections,  and  in  the  case  of  a mistake  the  participant 
was  instructed  to  try  again  until  the  task  was  successfully 
completed.  The  maximum  time  participants  were  allowed  to  work 
on  a particular  program  was  45  minutes  for  the  preliminary 
task  and  60  minutes  for  each  experimental  task.  Time  was  mea- 
sured to  the  nearest  minute. 

l-l  [1 
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Independent  Variables 


Program.  Three  programs  were  selected  for  the  generality 
of  their  content  and  their  understandability  to  programmers. 

The  first  program  sorted  and  categorized  alphabetic  response 
data  to  a questior.aire  (Veldraan,  1967).  The  second  program, 
an  accounting  routine,  produced  income  and  balance  statements 
(Nolen,  1971).  Program  3 kept  track  of  students'  test  grades 
and  calculated  their  semester  averages  (Brooks,  1978).  All 
programs  were  tested  prior  to  the  experiment. 

Length . The  inclusion  of  additional  subroutines  made  it 
possible  to  present  each  program  in  three  different  lengths. 

The  shorter  programs  had  25-75  statements,  medium  programs 
contained  100-150  statements,  and  the  longer  programs  contained 
approximately  175-225  statements.  (One  Fortran  77  version 
exceeded  the  225  line  limit  by  8 lines  because  of  the  number  of 
ELSE  and  ENDIF  statements  required). 

Program  listings  included  a two  or  three  line  explanation 
of  any  routine  or  function  that  was  called  by  a program  but  not 
presented  in  the  experimental  materials.  Participants  were 
told  to  assume  that  missing  routines  worked  correctly.  All  of 
the  input  and  output  files  were  presented  regardless  of  the 
length  of  the  program.  That  is,  for  the  shorter  version,  some 
of  the  input  was  read  in  and  some  of  the  output  was  produced 
by  subroutines  which  were  not  presented. 

Complexity  of  control  flow.  Three  versions  of  control 
flow  performing  identical  tasks  were  defined  for  each  program. 

Two  types  of  structures  were  implemented  in  Fortran  IV,  naturally 
structured  and  graph-structured.  A third  version  was  written 
in  Fortran  77  (Brainerd,  1978),  which  includes  the  IF-THEN-ELSE , 
DO-WHILE,  and  DO-UNTIL  constructs. 

The  Fortran  77  version  of  each  program  was  implemented  in 
a precisely  structured  manner.  All  flow  proceeeded  from  top  to 
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bottom,  and  only  three  basic  control  constructs  were  allowed: 
the  linear  sequence,  structured  selection,  and  structured 
iteration  (Figure  2) . 

The  graph- structured  version  of  each  program  was  implemented 
in  Fortran  IV  from  the  Fortran  77  version,  replacing  the  special 
constructs  but  producing  code  for  which  the  control  flow  graphs 
of  the  two  versions  were  identical.  All  nested  relationships 
could  be  reduced  through  structured  decomposition  to  a linear 
sequence  of  unit  complexity.  A full  discussion  of  reducibility 
is  presented  by  McCabe  (1976). 

Structured  constructs  are  awkward  to  implement  in  Fortran 
IV  (Tenny,  1974).  In  order  to  test  a more  naturally  structured 
flow,  limited  deviations  were  allowed  in  a third  version  of 
each  program.  These  deviations  included  such  practices  as 
branching  into  or  out  of  a loop  or  decision  and  multiple  returns. 
Control  flow  graphs  and  the  code  for  a section  of  a routine 
implemented  in  all  three  versions  of  control  flow  are  presented 
in  Figures  3 and  4. 

Each  program  was  indented  following  the  nesting  patterns 
presented  in  the  code.  Thus,  all  DO  loops  and  branching  in- 
structions were  indented.  For  naturally  structured  versions, 
decisions  were  made  arbitrarily  about  the  importance  of  various 
constructions,  and  indenting  was  necessarily  less  standardized 
than  for  the  graph- structured  and  Fortran  77  versions. 

Type  of  Bug.  Three  types  of  semantic  bugs  were  chosen 
from  a classification  developed  by  Hecht,  Sturm,  and  Trattner 
(1978):  computational,  logical,  and  data  errors.  Bugs  in  each 
category  were  defined  for  each  of  the  three  programs  in  order 
to  maximize  the  similarity  of  bugs  from  a single  category  across 
programs.  Computational  bugs  involved  a sign  change  in  an 
arithmetic  expression.  Logic  bugs  were  implemented  by  using  the 
wrong  logical  operator  in  an  IF  condition.  Data  bugs  involved 
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SEQUENCE: 

o *o 

SELECTION 

( IF-THEN-ELSE) : 

o 

ITERATION 

(DO  WHILE) : 

(DO  UNTIL) : 

0=0-0 

Figure  2.  The  Basic  Structured  Constructs 
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NATURALLY  STRUCTURED 

0 

IF  (ASTON  .IF.  1 .OXi  ASTON  .ST.  NASSOI)  00 
DO  400  I-I.NSTOBN 

IF  (COXIO  .10.  10(D)  00  TO  440 

TO  420 

400 

'■  comtixo* 
faint  4io,coxio 

410 

FORMAT  (110,301,*  IB  MOHS  EX  MOT  IX  fll Xl 
SO  TO  430 

MU 

420 

mXT  430.  CBXIB.ASNON 

430 

FOXMAT  (1X0. 30*,* IB  * , II , * ILLEGAL  ASSIGNMENT* , 13) 
00  TO  430 

440 

SCOX*  (K.  ASTON)  -VAC 

430 

ccvnxoo 

GRAPH- STRUCTURED 

s-i 

a>  (jlsmum  .ix.  x .0*.  asm  on  .ax.  sasscjm  so  to  420 


400 

IF  (CBXIO  .10.  IB(X)  .OX.  I .or.  XSTOOX)  SO  TO  403 

403 

• . I-«*l 

SO  TO  400 

IF  (Z  .IX.  M3TCOM)  SO  TO  413 

410 

FKXNT  410.CUXJB 

FORMAT  (HO.  30X,*  IB  NOMXCX  MOT 

IS  FIIXJ  *,i«) 

413 

SO  TO  430 

SCOXKZ.ASTOM)  “VAC 

420 

SO  TO  430 

run  430,  antio.AsxaM 

430 

FORMAT  (1X0.30X.* IB  *,I*,*  ILIXCAC 

assignment  *,I3) 

430 

cnirriwax 

FORTRAN  7 7 


«-l 

IF  (ASM  OX  .SI.  1 .AMO.  ASTON  .IX.  HAS3CN)  THEX  

oa  40a  mu  (caxio  .mi.  ib<o  .a/o.  t .ix.  «tccjij 

400  !“*♦! 

if  (i  .ct.  fra toc if i run 

rxurr  410.01X10 

4X0  fOJWAT  (1X0.  JOX,  * 10  NCMSER  MOT  » FILE*  * . I«J 

mi 

SCSXX (I , ASNOT) -VAC 

cxoiF 

nil 

- FAINT  430,  CO* 13. ASM OM 

430  FORMAT  (1X0, JOX.* IB  *,I«.*  ILIXCAC  ASSIGNMENT  *,13) 

itoif 

430  CSWTINO* 


Figure  4.  Examples  of  the  Three  Types  of  Control  Flow 
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wrong  index  values  for  variables.  Examples  of  these  bugs  and 
the  routines  in  which  they  were  inserted  are  presented  in 
Appendix  D. 


Each  bug  in  this  experiment  was  purposely  designed  to 
affect  only  a limited  area  of  code.  That  is,  each  calculation 
containing  a bug  occurred  near  the  corresponding  WRITE  and 
FORMAT  statements.  In  no  case  did  a bug  produce  errors  in 
routines  other  than  the  one  in  which  it  was  embedded,  and 
each  bug  appeared  in  only  one  line  of  code. 

Individual  Differences  Measures 

Scores  on  the  preliminary  exercise  were  used  as  a measure 
of  programming  ability  related  to  the  experimental  task. 
Participants  were  also  asked  to  complete  a questionnaire  about 
their  programming  experience.  The  information  requested  in- 
cluded specific  type  of  experience,  number  of  years  programming 
professionally  in  Fortran,  number  of  statements  in  the  longest 
Fortran  and  non-Fortran  programs  written,  the  first  programming 
language  learned,  and  number  of  languages  learned.  In  addition, 
various  programming  concepts  that  appeared  relevant  to  the 
experimental  programs  were  listed,  and  participants  were  asked 
to  mark  those  with  which  they  were  familiar. 

Complexity  Metrics 

Halstead's  E.  Using  a program  based  on  Ottenstein  (1976), 
Halstead’s  effort  metric  (E)  was  computed  from  the  source  code 
listings  of  the  27  experimental  programs,  representing  three 
distinct  programs  at  three  levels  of  structure  and  three  dif- 
ferent lengths.  The  computational  formula  was: 


E * nlN2  (N1  * N23  l0g2  (nl  * n2j 

2n2 

where , 

* number  of  unique  operators 
n2  * number  of  unique  operands 

* total  frequency  of  operators 
N2  * total  frequency  of  operands 

McCabe 1 s v(G) . McCabe's  metric  is  the  classical  graph- 
theory  cyclomatic  number  defined  as: 

v (G)  * # edges  - # nodes  + 2 (*  connected  components). 

McCabe  presents  two  simpler  methods  of  calculating  v (G) : 
the  number  of  predicate  nodes  plus  1 or  the  number  of  regions 
computed  from  a planar  graph  of  the  control  flow. 

Length.  The  length  of  the  program  was  the  total  number 
of  Fortran  statements,  excluding  comments.  The  total  number 
of  executable  statements  was  found  to  be  highly  correlated 
with  number  of  statements  (r  * .99,  £ <_  .001). 

Dependent  Variable 

The  dependent  variable  was  the  number  of  minutes  necessary 
for  the  participant  to  locate  and  correct  the  bug. 

Analysis 

The  analysis  of  data  was  conducted  in  two  phases.  The 
first  phase  was  an  experimental  test  of  the  independent  variables, 
while  the  second  phase  evaluated  the  software  complexity  metrics. 
In  the  first  phase,  experimental  data  were  analyzed  in  a 
hierarchical  regression  analysis.  In  this  analysis,  domains 
of  variables  were  entered  sequentially  into  a multiple  regres- 
sion equation  to  determine  if  each  successive  domain  significantly 
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improved  the  predictive  capability  of  the  equation  developed 
from  domains  already  entered.  Thus,  the  order  in  which  domains 
were  entered  into  the  analysis  was  important.  Variables  repre- 
senting the  different  conditions  of  experimentally  manipulated 
variables  were  effect-coded  (Kerlinger  § Pedhazur,  1973). 

The  second  phase  of  analysis  investigated  relationships 
between  the  time  to  find  the  bug  and  the  metrics,  Halstead's 
E,  McCabe's  v(G) , and  number  of  statements  in  the  program. 

All  correlations  are  Pearson  product-moment  correlations. 


A 


RESULTS 

Preliminary  Tasks 

Group  1 (Participants  1-27)  and  Group  2 (Participants 
28-54)  were  given  different  preliminary  tasks.  The  two  algorithms 
were  of  varying  difficulty,  producing  significant  differences 
in  both  time  to  completion  and  percent  of  completions.  Finding 
the  bug  in  the  greatest  common  divisor  algorithm  required  an  aver- 
age of  23.8  minutes  with  22%  failing  to  find  the  bug  in  45 
minutes,  while  the  sorting  algorithm  required  only  14.6  minutes 
with  only  4%  failing  to  find  the  bug.  However,  no  significant 
differences  in  performance  between  the  two  groups  occurred  on 
the  experimental  programs. 

Experimental  Manipulations 

The  average  time  to  locate  bugs  across  all  experimental 
conditions  was  20.1  minutes  (SD  = 16.2).  All  but  six  of  the 
162  experimental  tasks  comprising  this  experiment  were  completed 
successfully  during  the  allotted  60  minutes.  These  six  condi- 
tions were  not  associated  with  any  particular  factor. 

Despite  the  use  of  a preliminary  task  to  familiarize  the 
participants  with  the  experiment,  a significant  order  effect 
occurred  (£  ± .04),  indicating  that  learning  took  place  during 
the  first  of  the  three  experimental  tasks  (Figure  5). 

Results  of  a hierarchical  regression  analysis  of  the 
independent  variables  on  the  time  to  find  the  bug  are  presented 
in  Table  1.  Differences  in  solution  time  for  the  three  programs 
were  significant  (£  £ .01).  Finding  the  bug  in  the  accounting 
program  required  an  average  of  15.1  minutes,  20.0  minutes  in 
the  program  that  sorted  questionnaire  data,  and  25.0  minutes  in 
the  grade-scoring  program.  Increasing  the  length  of  the  programs 
had  a modest  effect  (£  <_  .06)  on  the  time  to  locate  and  correct 
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MINUTES  TO  FIND  BUG 


1 2 3 

ORDER  OF  PRESENTATION 


Figure  5.  Order  Effect  on  the  Three 
Experimental  Tasks 
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TABLE  1 


Hierarchical  Regression  Analysis 
for  Time  to  Find  Bug 


Variable 

df 

R2 

AR2 

(1) 

Program 

2 

.06** 

.06** 

(2) 

Presentation  order 

2 

.04* 

.04* 

(3) 

Type  of  bug 

2 

.00 

.00 

(4) 

Program  X bug 
interaction 

4 

. 26*** 

.26*** 

(5) 

Complexity  of  control 

flow  2 

.02 

.02 

All  variables 

12 

. 38*** 

2 

Note:  n = 162.  R"  column  represents  the  separate  regression 

for  each  domain. 

*n  < .OS 
**p  < .01 

***£  < .001 
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the  error.  The  average  time  for  the  short  program  was  16  minutes, 
while  the  medium  and  long  programs  required  a mean  of  21  and  23 
minutes,  respectively. 


Averages  for  the  three  error  categories  were  not  signi- 
ficantly different  from  one  another.  However,  a very  large 
interaction  occurred  between  type  of  bug  and  program  (Figure 
6).  This  interaction  accounted  for  the  largest  percent  of 
variance  (26%)  of  any  of  the  experimental  relationships  studied. 

No  significant  differences  in  performance  resulted  from  the 
three  types  of  control  flow. 

Software  Complexity  Metrics 

Intercorrelations  among  the  three  measures  of  software 
complexity  were  computed  from  the  27  different  versions  of  the 
programs  at  both  the  subroutine  and  program  levels  (Table  2). 
Substantial  intercorrelations  were  observed  among  Halstead's  E, 
McCabe's  v (G) , and  length  at  the  subroutine  level.  When  com- 
puted on  the  total  program,  the  correlation  between  length  and 
McCabe's  v (G)  increased,  while  the  correlations  for  Halstead's  E^ 
with  these  two  measures  were  substantially  smaller,  especially 
with  lines  of  code. 

Correlations  between  time  to  find  the  bug  and  the  com- 
plexity metrics  were  calculated  for  unaggregated  data  (three 
experimental  tasks  for  each  of  the  54  participants,  n = 162) 
and  for  data  averaged  over  the  six  scores  obtained  for  each 
program  (Table  3) . Correlations  for  the  aggregated  data  were 
much  higher  than  those  for  the  unaggregated  scores.  All  three 
metrics  predicted  performance  equally  well  at  the  subroutine 
level.  At  the  program  level,  however,  E was  the  best  predictor, 
accounting  for  more  than  twice  the  variance  in  performance  than  did 
the  length  (56%  versus  27%,  respectively).  The  variance  accounted 


for  by  v (G)  fell  between  these  values  (42%)  . A stepwise  multiple 
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TABLE  2 


Intercorrelations  among  Complexity  Metrics 


Metrics 

Correlations 

H v (G) 

Subroutine : 

. 

vIGl 

.92*** 

Length 

.89*** 

.81*** 

Program: 

±£1 

.76*** 

Length 

. 56*** 

. 90*** 

TABLE  3 
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Correlation  Between  Performance  Time 
and  Complexity  Metrics 


Metric 

...  -Correlations 

Unaggregated 
(n  • 162) 

Aggregated 
(n  - 27) 

Subroutine: 

Halstead's  E 

.25*** 

. 66*** 

McCabe's  v(G) 

.24*** 

.63*** 

Length 

.25*** 

.67*** 

Program; 

Halstead's  E 

.23*** 

.75*** 

McCabe's  v(G) 

.25*** 

.65*** 

Length 

.20** 

.52** 

r 

regression  analysis  indicated  that  length  and  v(G)  added  no  in- 
crements to  the  prediction  afforded  by  E. 

The  scatterplot  of  performance  with  Halstead's  E presented 
in  Figure  7 suggested  the  existence  of  a curvilinear  trend  in 
the  data.  The  significance  of  this  trend  was  tested  using  the 
second  degree  polynomial  regression  approach  suggested  by  both 
Cohen  and  Cohen  (1975)  and  Kerlinger  and  Pedhazer  (1973)  for 
investigating  curvilinear  relationships.  A multiple  correlation 
coefficient  of  .84  indicated  that  the  curvilinear  trend  accounted 
for  an  additional  15a  (£  <_  .001)  of  the  variance  beyond  that 
accounted  for  by  a linear  relationship.  The  prediction  equation 
generated  from  these  data  was: 

2 

minutes  to  find  bug  = 9.837  + .00239E  - . 00000000079E 

However,  with  few  data  points  in  the  right  tail  of  this  distri- 
bution for  Halstead's  E,  it  is  difficult  to  extrapolate  to  the 
exact  shape  of  the  curvilinear  trend.  Mo  curvilinear  trend 
was  detected  with  either  the  lines  of  code  or  McCabe's  v (G) . 

Experiential  Factors 

The  relationship  between  complexity  metrics  and  performance 
was  investigated  within  groups  of  programmers  differing  in  years 
of  professional  experience  programming  in  Fortran.  As  a heuris- 
tic, the  participants  were  divided  into  two  groups  of  approxi- 
mately equal  numbers:  those  with  three  or  fewer  years  experience 
and  those  with  more  than  three  years  experience.  The  results 
presented  in  Table  4 indicate  that  the  complexity  measures  were 
more  predictive  of  performance  for  less  experienced  programmers, 
especially  when  computed  at  the  subroutine  level. 

Two  measures  of  experience  were  also  found  to  be  related 
to  the  performance  of  less  experienced  programmers  (Table  5), 
but  not  to  the  performance  of  experienced  programmers.  The 
first  such  measure  was  the  number  of  programming  languages  the 
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TIME  TO  LOCATE 


50K  100K  1 50K  200K 

HALSTEAD*  S E 


Figure  7.  Scatterplot  of  Halstead’s  E and  Performance 
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TABLE  4 


Correlations  between  Performance  and  Complexity  Metrics 
Moderated  by  Years  of  Fortran  Experience 


Metrics 

Correlations 

<3  years 

In  » 75) 

>3  years 
(n  ■ 87) 

Subroutines : 

Halstead's  E 

. 39*** 

.11 

McCabe's  v(G) 

. 37*** 

.07 

Length 

. 33*** 

. 17 

Program: 

Halstead's  E 

. 38*** 

.20* 

McCabe's  v(G) 

.29*** 

.21* 

Length 

.18 

22* 

Note : Dividing  the  data  into  groups  of  programmers 

required  that  scores  be  analyzed  on  individual 
tasks  rather  than  on  tasks  averaged  by  program. 
Thus,  this  analysis  was  performed  on  the  75 
experimental  tasks  performed  by  the  25  parti- 
cipants with  3 or  fewer  years  of  Fortran 
experience  and  the  37  tasks  performed  by  the 
29  participants  with  more  than  3 years  experience. 

*2.  < -05 

**2.  < -01 

***£  < .001 
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TABLE  5 


Relationships  of  Experiential  Factors  to  Performance 
for  Programmers  Differing  in  Fortran  Experience 


<3  years  >3  years  Tota 
Relevant  experience  Jn  • 25)  (n  ■ 29)  '(n  - 

# of  programming  -.49**  -.03  -.19 

languages 


Questionnaire  score 


- .48** 


-.11 


-.33** 


participant  knew.  The  second  metric  was  the  number  of  items 
checked  on  the  experience  questionnaire  (Appendix  C) . The  moder- 
ating effects  of  programmer  experience  may  have  been  the  result 
of  greater  variability  in  performance  for  programmers  with  less 
experience  (Figure  8).  This  greater  variability  would  increase 
the  ability  of  correlational  tests  to  detect  significant  rela- 
tionships (Cohen  § Cohen,  1975). 


TIME  TO  LOCATE  ERROR  (MINUTES) 


Figure  8.  Scatterplot  of  Experience  and  Performance 


n 


1 

* 
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DISCUSSION 


Four  factors  were  found  to  influence  the  speed  with  which 
programmers  could  find  a bug  in  a computer  program.  These  fac- 
tors were  order  of  presentation,  specific  program,  a program 
by  error  interaction,  and  the  complexity  of  the  code  as  measured 
by  software  complexity  metrics.  Type  of  bug  and  type  of  control 
flow,  however,  did  not  account  for  a signficant  proportion  of 
the  variation  in  performance. 

Variance  in  programmer  performance  associated  with  dif- 
ferences among  the  programs  replicated  results  from  two  previous 
experiments  in  this  series  (Sheppard  et  al.,  1979).  However, 
a much  larger  percent  of  the  variance  in  performance  was 
accounted  for  by  a program  by  error  interaction.  It  appeared 
that  some  quality  of  the  algorithm  in  which  the  bug  was  embedded 
influenced  a programmer's  ability  to  locate  it.  The  time  required 
to  detect  similar  errors  contained  in  similar  statements  depended 
on  the  program  in  which  the  error  was  embedded.  This  result 
has  implications  for  the  usefulness  of  various  schemes  for  cate- 
gorizing software  bugs.  The  implied  value  of  these  taxonomies 
is  to  identify  properties  of  bugs  which  suggest  how  they  are 
created  or  how  difficult  they  are  to  detect.  Simple  taxonomies 
based  on  syntactic  relationships  will  probably  not  prove  suffi- 
cient for  this  purpose.  The  results  of  this  experiment  suggest 
that  the  detectability  of  a bug  depends  on  the  context  of  the 
algorithm  surrounding  it.  This  contextual  effect  may  determine 
the  optimal  search  strategy  for  finding  the  bug,  and  it  is  this 
search  strategy  that  needs  to  be  understood  if  debugging  performance 
is  to  be  improved. 

In  the  last  section  of  the  post-session  questionnaire,  the 
participants  were  asked  to  describe  their  searching  strategies 
for  locating  the  bugs.  Typically,  one  of  two  approaches  was 
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described.  In  the  first  strategy  the  programmer  tried  to  under- 
stand the  whole  program  from  beginning  to  end  before  searching 
for  the  section  with  the  bug.  In  the  second  strategy  the  pro- 
grammer used  appropriate  clues  in  the  output  to  go  directly 
to  the  section  containing  the  bug.  The  latter  appeared  to  be 
a much  quicker  strategy  for  debugging,  but  there  were  insuffi- 
cient data  for  a meaningful  statistical  analysis.  In  order  to 
improve  the  debugging  performance  of  programmers  it  will  be 
important  not  only  to  -.dentify  effective  search  strategies,  but 
also  to  identify  conditions  under  which  they  will  be  differenti- 
ally effective. 

No  significant  differences  were  evident  among  the  three 
types  of  top-down  control  flow  tested  in  this  experiment.  This 
finding  agrees  with  previous  results  (Sheppard  et  al.,  1979) 
where  differences  were  found  between  top-down  and  convoluted 
control  flow,  but  not  between  types  of  top-down  control  flow. 

The  minor  deviations  from  strictly  structured  coding  allowed 
in  the  naturally  structured  version  of  this  experiment  did  not 
adversely  affect  performance.  Summarizing  the  combined  results 
of  the  three  experiments,  it  would  appear  that  the  overall  top- 
down  quality  of  the  control  flow  is  important  to  performance, 
but  careful  attention  to  strict  structuring  does  not  appear  to 
improve  programmer  performance  significantly. 

Since  no  difference  was  found  between  the  graph- structured 
and  Fortran  77  program  versions,  it  would  appear  that  the  newer 
constructs  provide  little  additional  aid  in  a debugging  task 

beyond  that  provided  by  a top-down  flow.  Only  five  of  the  54 

> ' 

participants  had  previously  used  Fortran  77,  so  a lack  of 
familiarity  with  the  new  constructs  may  have  prevented  them 
from  finding  the  bug  more  quickly  in  Fortran  77  than  in  Fortran 
IV.  However,  immediately  prior  to  the  experiment  a short  train- 
ing session  was  conducted  with  each  group  of  participants  in 
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which  the  new  Fortran  77  constructs  were  discussed  in  detail. 

These  constructs  were  similar  to  those  implemented  in  Fortran 
IV,  and  the  participants'  previous  lack  of  familiarity  with 
them  was  probably  not  a significant  factor  in  their  performance. 

Most  laboratory  studies  exhibit  a certain  degree  of  arti- 
ficiality that  is  necessary  for  experimental  control.  In  this 
experiment  participants  were  told  there  was  only  one  bug  in 
a program.  While  this  situation  differs  from  a normal  program- 
ming environment,  it  should  not  have  affected  participant's  ability 
to  perform  the  tasks.  These  experimental  tasks  may  have  been 
simpler  to  perform  than  typical  debugging  problems  since  there 
was  greater  certainty  about  the  bugs.  Further,  differences 
between  the  correct  and  erroneous  output  were  clearly  marked  on 
the  erroneous  output,  reducing  the  amount  of  comparison  neces- 
sary to  discover  what  problems  had  occurred. 

During  a typical  debugging  problem  a programmer  could 
refer  to  the  functional  specifications  for  a program  or  to 
comments  included  in  the  code.  However,  no  such  aids  were 
made  available  in  this  experiment.  The  participant's  compre- 
hension of  the  program's  function  had  to  be  gleaned  from  the 
code  or  from  the  input  and  output  listings.  The  latter  were 
designed  to  be  self-explantory , with  each  section  labeled  appro- 
priately; e.g.,  "FINAL  COURSE  GRADE"  or  "TRIAL  BALANCE".  Al- 
though adding  some  artificiality  to  the  experimental  situation, 
the  absence  of  documentation  was  an  attempt  to  equalize  the 
amount  of  information  provided  by  materials  other  than  the  code. 

Software  Complexity  Metrics 

The  results  of  this  experiment  not  only  replicated  the 
results  obtained  in  our  previous  research,  but  also  demonstrated 
that  more  viable  results  could  be  obtained  when  limitations  in 
our  earlier  experimental  procedures  were  overcome.  For  instance, 


30 


our  previous  research  was  conducted  exclusively  on  small-sized 
(35-55  lines  of  code)  programs,  which  seems  to  have  limited 
the  results  in  three  ways.  First,  the  range  of  values  on  the 
factors  studied  in  those  programs  seems  to  have  been  too 
restricted  to  detect  the  size  of  relationships  observed  here. 
Second,  the  curvilinear  relationship  observed  in  this  experi- 
ment between  Halstead's  E and  performance  would  not  have  been 
observed  if  longer  programs  had  not  been  used  in  the  experimental 
tasks.  Third,  the  extremely  high  intercorrelation  between  length 
and  Halstead ' s E at  the  subroutine  level  suggests  that  both  are 
measuring  program  volume.  With  larger  programs  the  information 
measured  appears  to  differ;  that  is,  Halstead's  E measures 
something  in  addition  to,  but  inclusive  of,  factors  measured 
by  length. 

Many  small-sized  programs  can  be  grasped  by  the  typical 
programmer  as  a cognitive  gestalt.  The  psychological  complexity 
of  such  programs  is  adequately  represented  by  the  volume  of  t-he 
program  in  terms  of  the  number  of  lines  of  code.  When  the  code 
grows  beyond  a subroutine,  its  complexity  to  the  programmer 
is  better  assessed  by  measuring  constructs  other  than  the  num- 
ber of  lines  of  code.  This  may  result  partly  because  programmers 
cannot  grasp  the  entire  program  within  their  mental  spans  at 
a single  time.  For  larger  programs  the  difficulty  programmers 
experience  is  better  represented  by  counts  of  operators,  operands, 
and  control  paths.  Thus,  as  the  size  of  a program  increases, 
Halstead's  E seems  to  be  a better  measure  of  its  psychological 
complexity. 

One  possible  explanation  for  the  superior  predictive 
ability  of  Halstead's  E_  is  that  the  relationship  between  program 
size  and  performance  is  curvilinear,  and  the  algorithmic 
transformation  with  the  Halstead  measure  captures  this  relation- 
ship while  lines  of  code  does  not.  There  was  no  evidence  in 
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these  data  of  a curvilinear  relationship  between  lines  of  code 
and  performance.  On  the  other  hand,  a curvilinear  relationship 
did  exist  between  Halstead's  E and  performance.  This  trend 
suggests  that  as  Halstead's  E grows  larger,  a program  becomes 
more  psychologically  complex,  but  the  increments  in  difficulty 
grow  smaller  and  smaller.  In  the  experimental  task  used  in 
this  debugging  experiment,  there  seemed  to  be  an  amount  of 
time  that  was  typically  required  to  locate  a bug  within  a sub- 
routine once  the  correct  subroutine  had  been  identified 
(approximately  16  minutes).  Added  to  this  baseline  rate  was 
the  time  required  to  identify  the  proper  subroutine.  The  cur- 
vilinearity  of  the  relationship  between  time  to  find  the  bug 
and  Halstead's  E appeared  to  result  from  the  time  required  to 
isolate  the  problem  subroutine. 

The  moderating  effects  of  experiential  factors  also 
replicated  the  results  found  in  the  earlier  experiments.  The 
metrics  again  proved  -to  be  better  predictors  of  performance  for 
programmers  with  three  or  fewer  years  experience  in  Fortran 
than  for  those  with  more  than  three  years  experience.  It  was 
also  possible  to  predict  the  performance  of  an  individual 
programmer  from  job  history  data.  Several  important  factors 
seemed  to  be  the  number  of  languages  a programmer  had  used  and 
familiarity  with  certain  programming  concepts.  These  predictions 
from  job  history  were  also  more  valid  for  programmers  who  had 
three  or  fewer  years  of  experience  in  Fortran.  Future  work 
is  needed  to  refine  the  use  of  experiential  questionnaires  for 
use  in  personnel  functions  such  as  selection,  assessment  for 
training  needs,  and  placement. 

Code  which  is  more  psychologically  complex  may  also  be 
more  error-prone  and  difficult  to  test.  The  results  of  this 
experiment  provide  evidence  that  the  software  complexity  metrics 
developed  by  Halstead  and  McCabe  are  related  to  the  difficulty 
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programmers  experience  in  locating  errors  in  code.  Thus  these 
metrics  appear  to  be  capable  of  satisfying  several  practical 
applications.  They  can  be  used  in  providing  feedback  both  to 
programmers  about  the  complexity  of  the  code  they  have  developed 
and  to  managers  about  the  resources  that  will  be  necessary  to 
maintain  particular  sections  of  code.  Further  evaluative 
research  needs  to  assess  the  validity  of  these  uses  in  ongoing 
software  projects. 
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Instructions  To  Participants 


HELLO 

Today  we  are  going  to  ask  you  to  participate  in  an  experiment 
which  we  hope  will  be  both  entertaining  and  challenging.  This  study  is 
being  sponsored  by  GE  and  the  Office  of  Naval  Research  to  examine  the 
properties  of  bugs  in  computer  programs.  To  accomplish  this,  we  will 
give  you  several  different  programs  and  ask  you  to  find  a bug  in  each 
one.  Our  purpose  is  to  evaluate  characteristics  of  programs  which  make 
them  easier  to  debug.  It  is  not  to  evaluate  computer  programmers.  Your 
performance  on  a program  will  be  compared  only  to  your  performance  on 
other  programs,  and  no  form  of  competition  is  involved.  We  hope  you  will 
assist  us  in  what  we  believe  is  important  research  in  software  engineer- 
ing. However,  your  involvement  is  voluntary  and  you  are  free  to  with- 
draw from  participation  at  any  time.  All  programs  and  papers  that  you 
will  be  handed  are  carefully  numbered  so  it  is  not  necessary  for  you  to 
put  your  name  on  any  of  these.  These  numbers  are  solely  for  the  purpose 
of  identifying  different  programs  and  cannot  be  used  to  identify  you 
as  an  individual.  Your  work  will  remain  completely  anonymous  and  data 
collected  in  this  study  will  be  used  for  research  purposes  only. 

For  each  task,  you  will  be  given  a program,  the  input  files, 
and  both  the  correct  and  incorrect  output  produced  by  the  program  you 
have.  Your  job  is  to  identify  the  bug  and  correct  it.  Each  bug  can  be 
corrected  by  inserting,  deleting  or  correcting  one  line  of  code.  When 
you  believe  you  have  corrected  the  bug,  please  inform  the  monitor  by 
raising  your  hand. 

During  this  experiment,  each  of  you  will  be  working  on  a dif- 
ferent program.  If  others  seem  to  finish  earlier  than  you,  don't  be 
concerned.  They  may  have  been  working  on  a program  which  did  not  require 
as  much  time. 

We  will  begin  with  a short  introductory  program.  Raise  your 
hand  as  soon  as  you  have  found  the  bug  and  corrected  it.  Because  of 
the  concentration  required  for  this  task,  we  ask  you  to  make  an  extra 
effort  to  remain  quiet  so  that  others  will  not  be  distracted.  When  you 
have  completed  all  three  experimental  programs  you  are  free  to  leave, 
but  please  do  not  discuss  any  of  the  programs  you  worked  on  with  anyone 
else  until  after  we  have  completed  all  experimental  sessions.  We  request 
this  of  you  only  to  insure  that  our  results  are  valid. 

If  there  are  any  questions,  please  ask  them  at  this  time. 


PRBCCDINQ  PAO*  BLATK-NOT  FILMED 
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FORTRAN  77 


One  of  the  programs  you  see  will  be  in  Fortran-77.  It  is  very 
similar  to  standard  Fortran  except  for  the  addition  of  three  con 
structs . 

F77  allows: 

1.  if:  IF  (condition)  THEN 

any  statement 
• 

ENDIF 

or:  IF  (condition)  THEN 

any  statement 

ELSE 

any  statement 
• 

ENDIF 

2.  do  while;  DO  statement  # WHILE  (condition) 

any  statement 
statement  * 

3.  repeat  until: 

DO  statement  # UNTIL  (condition) 
any  statement 


statement  # 


Miscellaneous : 

input  and  output  files  may  be  referenced  by  a string 

spaces  are  not  important 

line  lengths  are  not  important 

§ after  the  line  number  indicates  a continuation  line 
Fortran  77  IF's  can  be  nested 

program  order  will  be  the  following: 
input 
program 
correct  output 

incorrect  output  with  bad  results  circled 


APPENDIX  B 
PRETESTS 
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Greatest  Common  Divisor  Algorithm 


SOURCE  CODE  LISTING 

110  INTEGER  GCO , REMAIN 

115  1-0 

120  READ ( * EUCDAT" , 1 ) M,N 

130  1 FORMAT (215) 

140  IF  (M.EQ.O)  THEN 

150  GCD-N 

160  ELSE 

170  IF  (N.EQ.O)  THEN 

180  GCD-M 

190  ELSE 

200  IG-M/N 

210  REMAIN »M-N*IG 

220  DO  2 WHILE  (REMAIN. ME. 0 .AND.  I.LT.100) 

230  M-N 

240  M-REMAIN 

250  IG-M/N 

260  REMAIN»M-N*IG 

265  I-I+l 

270  2 continue 

275  GCD-N 


28U 

END  IF 

290 

END  IF 

294 

IF  (I.LT.10U)  THEN 

295 

PRINT  3.GCD 

296  J 

FORMAT (*  GCD  - * ,15) 

297 

ELSE 

298 

PRINT  4 

299  4 

FORtlAT  ( * TOO  MANY  ITERATIONS* 

300 

ENDIF 

301 

STOP 

310 

END 

INPUT 


EDCDAT 


INPUT 


Sorting  Algorithm 


DATAPRE 


no 

30 

31 
1 

153 

193 

62 

78 

16 

1 

193 

62 

78 

74 

168 

192 

199 

999 

5 

78 

79 

56 
9 

57 
3 


100 

IMPLICIT  INTEGER ( A— Z ) 

no 

DIMENSION  A { 50 ) ,B(50) 

115 

READ ( * DATAPP.E'" ,10)  N 

116 

DO  5 I • 1,  N 

120 

5 

READCOATAPRE’ ,10)  A(I) 

130 

10 

FORMAT  (I  3) 

140 

DO  100  J » 1,  M 

160 

SMALL  - A(l) 

170 

M - 1 

180 

DO  20  K » 2,N 

190 

15 

IF ( A ( K ) .LT.  SMALL)  GO 

200 

SMALL  * AIK) 

210 

M - K 

220 

20 

CONTINUE 

230 

3 ( J ) « SMALL 

240 

AIM)'  « 1000 

250 

100 

CONTINUE 

251 

DO  101  I • 1,  N 

260 

101 

PRINT  110,  B(I) 

261 

no 

FORMAT (2X, 14) 

270 

STOP 

280 

END 

CORRECT 

OUTPUT 


INCORRECT 

OUTPUT 


/ s 

1 

' 999 

1 

/1000 

3 

I 1000 

5 

/ 1000 

9 

/ 1000 

16 

1 1000 

30 

1 1000 

31 

1000 

56  * 

1000 

57 

1000 

62 

looo 

62 

100U 

74 

1000 

78 

1000 

78 

1000 

78 

1000 

79 

1000 

110 

1000 

153 

1000 

168 

1000 

192 

1000 

193 

1000 

193 

\ 1000 
1000 

199 

999 

\1000 
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APPENDIX  C 

EXPERIENCE  QUESTIONNAIRE 


SUMMARY 


QUESTIONNAIRE 


We  would  like  you  to  answer  the  following  questions  for  our  research  purposes: 

1.  How  long  have  you  been  programing  in  FORTRAN  professionally? 

years  months 

2.  Please  circle  one  of  the  following:  Has  your  experience  primarily 
been  with 

a.  Engineering 

b.  Statistical 

c.  Non-Numeric 

d.  Business 

e.  Other  (Please  describe 

Also,  please  briefly  describe  your  specific  areas  of  programing  experience. 


3a.  Aoproximately  how  many  source  code  instructions  were  in  the  longest. 
FORTRAN  program  that  you  have  ever  written?  Please  exclude  blank  lines 
and  coments . 

b.  What  is  the  length  of  the  longest  non-FQRTRAN  program  you  have  ever 

wri tten  ? 

what  language? 

A.  Place  a check  in  the  appropriate  blank  for  each  of  the  Allowing  languages 
you  have  used: 

a.  FORTRAN  

р.  FORTRAN  77  

с.  COBOL  

d.  ?L/1  

e.  3ASIC  

f.  PASCAL  

g.  APL  

h.  ALGOL  

i.  JOVIAL  

j.  assemoler  

k.  RPG  

l . S.N08CL  

m.  LISP  

n.  other  
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prsukdxno  page  bunk-not  filmed 


r 


5.  What  was  the  first  programming  language  you  learned? 

6.  Place  a check  in  the  appropriate  blank  for  each  of  the  following  you 
have  used  when  coding: 


DO  statement 
arrays 

CALL  with  parameters 
COMMON 

READ  statement 
PRINT  statement 
WRITE  statement 
FORMAT  statement 
'X'  format  specification 
‘A’  format  specification 
'I'  format  specification 
' F'  format  specification 
continuation  lines 
'H*  format  specification 
implicit  data  types 
IF  THEN  ELSE  (concept) 

CREDITS  in  monetary  trans. 
0E3ITS  in  monetary  transactions 
Financial  transactions 
TRIAL  3ALANCE  computation 
GENERAL  LEDGER  Accounting 
REAL  NOTATION  (0.31) 

Tax  comoutation 

carriage  control  Holerith 

2 or  more  dimensional  arrays 

using  " in  output  formats 

IMPLICIT  statement 

HEAP  sorts 

stacks 

tree  search 

NAMELIST  statement 

'T'  format  specification 

interrupt  handlers 

parsers 

lexical  analyzers 

graphics  drivers  and  handlers 


DATA  statement 

conversion  from  alpha  to 
string  variables 

IF  of  more  than  1 condition 
decimal  to  integer  conversions 
percentile  computation 
DO  WHILE  (concept) 

DO  UNTIL  (concept) 
weighting  numbers 

rounding  numbers  when  don't 
have  rounding  function 

used  an  array  reference  as  an 
index  to  another  array 

finding  Maximum  value  in  array 

finding  mean  of  values 
in  an  array 

printing  titles  in  an  output 

computing  frequencies  of  items 

running  SUMS 

Bubble  SORT 

implied  DO 

equivalenced  arrays 

String  variables 

used  the  binary  equivalent 
of  c.naracters 

Interactive  debugger 
symbolic  debugger 
TRACE  .mechanism 
Octal  or  Hex  dumps 
Double  decision 
*r»e  field  I/O 
matrix  inversion 
pattern  matching 
device  drivers 
batch  systems 
interactive  systems 
list  handling  languages 
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7.  Please  indicate  in  the  space  provided  any  other  particulars  which 
you  feel  may  have  an  effect  on  your  performance(for  instance,  if 
most  of  your  work  is  involved  in  debugging  systems  we  would  like  to 
know  that). 


8.  Please  indicate  your  reactions  to  the  experiment  and  anything  that 
might  help  us  understand  how  you  undertook  the  task.  Please  include 
any  problems  or  insights  you  may  have  had. 
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Questionnaire  Scoring  Program 


SUBROUTINE  SCORE (LX,  MSEX,  L) 

INTEGER  MSEX (100,  2),  CAT(IOO)  , RLENG(IOO) 

ALPHA  LX (100, 3) , ROOT (100,3) , KA , KB,  KC 
DIMENSION  F ( 25 , 2 ) I 

READ ( “DATC12" , 4)  NR,  NC 

4 FORMAT (213) 

5 FORMAT  (12,  IX,  3A4) 

DO  10  I - 1 , NR 

READ (* DATC12* , 5)  CAT(I),  (ROOT(I,J),  J-  1,3) 

10  CONTINUE 

DO  20  I • 1,NR 

RLENG(I)  - LGTH (ROOT (I , 1), ROOT (I,  2),  ROOT (I,  3)) 

20  CONTINUE 

PRINT  1 
PRINT  6 

1 FORMAT (///17HOECHO  INPUT  ROOTS) 

S FORMAT  (1H0 , 11X , 5 H ROOTS , 3X , 3HCATEGORY) 

7 FORMAT (10X,3A4,I10) 

DO  30  I - 1,  NR 

PRINT  7 , (ROOT (I , J) , J - 1,  3),  CAT(I) 

30  CONTINUE 

DO  40  I - 1,  NC 
F ( 1 , 1)  - 0.0 
F ( I , 2)  - 0.0 
40  CONTINUE 

DO  60  I » 1,L 
KA  - LX (I,  1) 

KB  • LX (I ,2) 

KC  - LX ( I , 3 ) 

LL  - LGTH (KA,  KB,  KC) 

CALL  ROOTER (KINDEX. ROOT,  RLENG,  NR , KA , KB,  KC,  LL) 

IF(KI NDSX  |7ne3  0)  THEN > iO&/C. 

J - CAT  (KINDEX)  ' c 1 

DO  50  11-1,2  | r . 

F ( J , II ) - F ( J , II)  + MSEX(I.llli)— » I 1 I tP  AT  A 
50  CONTINUE 

ENDIF 
60  CONTINUE 

PRINT  3 

3 FORMAT (///l OX , 31HC ATEGORY  TOTAL  MALE  FEMALE) 

DO  90  I - 1 , NC  , _ 

T - F(I,1)  E F(I,2) ^ I — | 

PRINT  8,  I,  T.  F ( 1 , 1 ) , P(I,2) 

90  CONTINUE 

3 FORMAT  (13X,I1,2(4X,F5.0) ,2X,FS.O) 

RETURN 

END 


NOTE:  The  program  is  correct  as  printed.  Handwritten  changes 

indicate  the  errors  the  participants  saw. 
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Accounting  Program 


400 

410 

430 


430 
440 
4 SO 

4 60 

470 
4 80 

4 90 


SUBROUTINE  TRBAL 

COMMON  IACCT1 (100) ,IACCT2(100) ,IACCT3(100) ,IACCT4(100) ,BAL(100) ,N 
PRINT  400 

FORMAT  (1HO,20X,25H*****  TRIAL  BALANCE  ///) 

PRINT  410 

FORMAT  (1H  , 5HACCT. ,28X,5HDE8IT,9X, 6HCREDIT) 

SOEB IT-0.0 
SCRD-0 . 0 
PRINT  420 


PORMAT  UH  .70  ( 1H  — ) ) 

DO  480  I -1,N  

IF  (BAL(I)  I.EO.l  0.0)  GO  TO  480- 


>1. m£.\  Loac. 


IF  (I  .07.  20)  GO  TO  450 

IF  (I  .EQ.  4 .OR.  I .EQ.  13  .OR. 

I .EQ.  15)  GO  TO  460_  „ 

SDEBIT-SDEBIT+BAUmJ  ' + 

PRINT  440,  I.IACCTlTl) ,IACCT2(I) ,IACCT3(I) ,IACCT4(I) ,BAL(I) 
FORMAT  (1H  ,I3,2X,4A4,5X,F12.2) 

GO  TO  480 
CONTINUE 

IF  (I  .CT.  60)  GO  TO  430 
IF  (I  .EO.  53)  GO  TO  430 

SCRD-SC'RIfcfeAL  ( I ) 

PRINT  470,  I , IACCT1 ( I ) , IACCT2 ( I ) , IACCT3 ( I ) , IACCT4 ( I ) , BAL ( I ) 
FORMAT  (1H  ,I3,2X,4A4,20X,P12.2) 


7171  lata 


CONTINUE 
PRINT  420 

PRINT  490,  SDEBIT , SCRD 

FORMA'i  (la  ,26X,F12.2,3X,F12.2) 

PRINT  420 

PRINT  420 

RETURN 

END 
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Grading ■ Program 


2694 

2695 

2696 
697 

2698 
2700 
2705 
27X0 
2720 
2730 
2740 
2750 
2760 
2770 
2780 
2790 
2800 
2810 
2820 
2830 
2840 
2850 
2860 
2870 
2880 
2890 
2900 
2910 
’920 
.930 
2940 
2950 
2960 
2970 
2980 
29  90 
3000 
3010 
3020 
3030 
3040 
3050 
3060 
3070 
3080 
3090 
3100 
3110 
3120 
3130 
3140 
3150 
3160 
3170 
•180 
3190 
3200 


560 

680 

690 

700 


710 

720 

730 

740 

750 

760 


770 

780 

790 


800 

810 


820 

830 

840 

850 


860 

370 


SUBROUTINE  GRADR2 (SCORE, PGRA08 , FREQ, HIGH , PERCNT, PT, TOTAL) 

IMPLICIT  INTEGER (A-Z) 

COMMON  NSTUON , NASSGN , ID, CURIO 

DIMENSION  SCORE  (300 , 20)  ,PERCNT(5)  , PGRADE ( 5 ) ,L8(S) , 

TOTAL (100) ,FREO(100) ,10(300) 

PRINT  680 
FORMAT  (1H  ,///) 

FORMAT  (1H0.33X,  "OVERALL  SCORE"  , 3X,  * FREQUENCE) 

I "HIGH  

IP  ( FREQ  ( I ) rm  0)  PRINT  700.  I , FREQ  ( I ) > [Tf]  LC^/C 

FORMAT  (lH0,T3xTl3,10X,I3)  " 

I-I-l 

IF  (I  . GT . 0)  GO  TO  690 
PRINT  560 
PRINT  710 

FORMAT  (1H0.30X, "LOWER  SOUNDS  FOR  EACH  GRADE") 

SUM-FREQ(HIGH) 

CUT-HIGH 
DO  760  PT-1 , 4 

QUOTA-IFIX ( FLOAT (SUM) /FLOAT (NSTUDN ) *1 00+ . 5) 

IF  (QUOTA  .GE.  PERCNT (PT ) ) GO  TO  740 
CUT -CUT-1 

IP  (CUT  , LT.  1)  GO  TO  770 

IP  ( FREQ(CUT)  .LT.  1)  GO  TO  730 


SUM-SUf0FREQ (CUT) 

GO  TO  720 
LS (PT) -CUT 

PRINT  750,  PGRADE (PT) , L3(PT) 

FORMAT  (1H0,41X,A1,2X, 13) 

SUM-0 
CONTINUE 
PT-PT  + 1 
GO  TO  790 
DO  780  I-PT.4 
LB  ( PT ) -0 
L3  ( 5 ) - 0 

PRINT  750  , PGRADE  (PT ) , LB  ( PT ) 

PRINT  560 
PRINT  300 

FORMAT  (1H0, 36X, "FINAL  COURSE  GRADE") 

DO  850  1-1 , NSTUDN 

PRINT  810,  ID (I ) , (SCORE ( I , J) , J-l , NASSGN) 
FORMAT  (IH0.31X, 1 8 , 20 ( IX , 13) ) 

DO  820  J-l.  5 

IF  (TOTAL ( I ) .GE.  L8(J))  GO  TO  830 
CONTINUE 

PRINT  840,  TOTALljn. PGRADE  (J) 

■■abia  a t 7 uA  ■ni/p  Birr  caa 


->[•=]  AS^is/Y/ncA/r 


(S) 


ZATA 


FORMAT  ( 1H0 , 32X , OVERALL  SCORE 
CONTINUE 
GTOTAL-O 

DO  360  I -1. NSTUON 

GTOTA  L -GTOTA  L *TOTA  L ( I ) 

MEAN-I FIX (PLOAT(GTOTAL) /FLOAT (NSTUDN )♦.  5) 
PRINT  370,  MEAN 

FORMAT  ( 1 HO , 3 IX , "MEAN  SCORE  - " . 1 3 ) 

RETURN 

SNO 


- " , 13, " GRADE  - " , A1 ) 
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